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Three Learning Principles 


Roadmap 
@ When Can Machines Learn? 
Why Can Machines Learn? 
© How Can Machines Learn? 
€ How Can Machines Learn Better? 










Lecture 15: Validation 


(crossly) reserve validation data to simulate testing 
procedure for model selection 








Lecture 16: Three Learning Principles 
e Occam’s Razor 

e Sampling Bias 
e Data Snooping 
e Power of Three 
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Three Learning Principles Occam's Razor 


Occam's Razor 
An explanation of the data should be made as simple as 
possible, but no simpler .—Albert Einstein? (1879-1955) | 


entia non sunt multiplicanda praeter necessitatem 
(entities must not be multiplied beyond necessity) 
—William of Occam (1287-1347) 


'Occam's razor' for trimming down 
unnecessary explanation 
figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons 





Three Learning Principles Occam's Hazor 


Occam's Razor for Learning 





The simplest model that fits the data is also the most 
plausible. 











which one do you prefer? :-) 





two questions: 

@ What does it mean for a model to be 
simple? 

6 How do we know that simpler is better? 
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Three Learning Principles Occam's Hazor 


Simple Model 














simple hypothesis h 
e small Q(h) = ‘looks’ simple 


e specified by few 
parameters 


simple model H 
e small Q(H) = not many 


e contains small number of 
hypotheses 








h specified by £ bits = |H| of size 2° 


small Q(h) = small Q(H) 


simple: small hypothesis/model complexity | 
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Three Learning Principles Occam's Hazor 


Simple is Better 


in addition to math proof that you have seen, philosophically: 
simple H 
= smaller m;,(N) 


— less ‘likely’ to fit data perfectly 01) 


==> more significant when fit happens 








direct action: linear first; 
always ask whether data over-modeled | 
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Three Learning Principles Occam's Razor 


Fun Time 


Consider the decision stumps in R' as the hypothesis set H. Recall 
that m,(N) = 2N. Consider 10 different inputs X4, X2, ..., X19 coupled 
with labels y; generated iid from a fair coin. What is the probability that 
the data D = {(Xn, Yn) }1°, is separable by H? 


1 
o 1024 

10 
e T024 
© TE 


100 
© 1024 
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Three Learning Principles Occam's Razor 


Fun Time 


Consider the decision stumps in R' as the hypothesis set H. Recall 
that m4 (N) = 2N. Consider 10 different inputs X1, X2, ..., X49 coupled 
with labels y; generated iid from a fair coin. What is the probability that 
the data D = {(Xn, Yn) }1°, is separable by H? 


o 1024 

10 
e T024 
© an 


100 
© 1024 


Reference Answer: 9 


Of all 1024 possible D, only 2N — 20 of them 
is separable by H. 
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Three Learning Principles Sampling Bias 


Presidential Story 


e 1948 US President election: Truman versus Dewey 


e a newspaper phone-poll of how people voted, 
and set the title Dewey Defeats Truman' based on polling 





Daily 9v D 
WEY DEFEATS TRUMAN 





who is this? :-) 
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Three Learning Principles Sampling Bias 


The Big Smile Came from ... 





Truman, and yes he won | 


suspect of the mistake: 
e editorial bug?—no 
e bad luck of polling (6)?—no 





hint: phones were expensive :-) 
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Three Learning Principles Sampling Bias 


Sampling Bias 





If the data is sampled in a biased way, learning will pro- 
duce a similarly biased outcome. 











e technical explanation: 
data from P; (x, y) but test under P5 4 P;: VC fails 


e philosophical explanation: 
study Math hard but test English: no strong test guarantee 


'minor VC assumption: 
data and testing both iid from P | 
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Three Learning Principles Sampling Bias 


Sampling Bias in Learning 


A True Personal Story 


e Netflix competition for movie 
recommender system: 
10% improvement = 1M US dollars 


e formed Dyal, 
in my first shot, 
Eya(g) showed 13% improvement 


e why am still teaching here? :-) 








viewer 





Match movie and. 


wtions, predicted 
viewer factors d 


rating 


























movie | e 


NN. K 








validation: random examples within D; 
test: ‘last’ user records ‘after’ D 





Three Learning Principles Sampling Bias 


Dealing with Sampling Bias 





If the data is sampled in a biased way, learning will pro- 
duce a similarly biased outcome. 











e practical rule of thumb: 
match test scenario as much as possible 
e e.g. if test: ‘last’ user records ‘after’ D 


e training: emphasize later examples (KDDCup 2011) 
e validation: use ‘late’ user records 






last puzzle: 


danger when learning ‘credit card approval’ 
with existing bank records? 
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Three Learning Principles Sampling Bias 


Fun Time 


If the data D is an unbiased sample from the underlying distribution P 
for binary classification, which of the following subset of D is also an 
unbiased sample from P? 


© all the positive (yn > 0) examples 


O half of the examples that are randomly and uniformly picked from 
D without replacement 


© half of the examples with the smallest ||x,|| values 
@ the largest subset that is linearly separable 
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Three Learning Principles Sampling Bias 


Fun Time 


If the data D is an unbiased sample from the underlying distribution P 
for binary classification, which of the following subset of D is also an 
unbiased sample from P? 


© all the positive (yn > 0) examples 


O half of the examples that are randomly and uniformly picked from 
D without replacement 


© half of the examples with the smallest ||x,|| values 
@ the largest subset that is linearly separable 








Reference Answer: eo 


That's how we form the validation set, 
remember? :-) 
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Three Learning Principles Data Snooping 


Visual Data Snooping 





Visualize X = R? 
e full $2: z = (1,54, X2, X2, x1xo, X2), dvc = 6 
e o z= Ge XS dvc = 3, after visualizing? 
e or better z = (1, x? + x2) , dvc = 2? 
* or even better z = (sign(0.6 — x? — x2))? 











—careful about your brain’s ‘model complexity’ 


for VC-safety, ® shall be 
decided without ‘snooping’ data | 
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Three Learning Principles Data Snooping 


Data Snooping by Mere Shifting-Scaling 





If a data set has affected any step in the learning pro- 


cess, its ability to assess the outcome has been com- 
promised. 











e 8 years of currency trading data 80 
e first 6 years for training, 
last two 2 years for testing 
e X = previous 20 days, 
y = 21th day 
e snooping versus no snooping: i 
superior profit possible i 


snooping 





Cumulative Profit % 





no snooping 
200 300 400 500 
Day 


e snooping: shift-scale all values by training + testing | 





e no snooping: shift-scale all values by training only 
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Three Learning Principles Data Snooping 


Data Snooping by Data Reusing 


Research Scenario 
benchmark data D 

e paper 1: propose Hı that works well on D 

e paper 2: find room for improvement, propose H2 
—and publish only if better than Hı on D 
paper 3: find room for improvement, propose H3 
—and publish only if better than H2 on D 











if all papers from the same author in one big paper: 
bad generalization due to dyc(Um?Lm) 

step-wise: later author snooped data by reading earlier papers, 
bad generalization worsen by publish only if better 





if you torture the data long enough, it will confess :-) | 
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Three Learning Principles Data Snooping 


Dealing with Data Snooping 


truth—very hard to avoid, unless being extremely honest 
extremely honest: lock your test data in safe 
less honest: reserve validation and use cautiously 





be blind: avoid making modeling decision by data 


be suspicious: interpret research results (including your own) by 
proper feeling of contamination 





one secret to winning KDDCups: 


careful balance between 
data-driven modeling (snooping) and 
validation (no-snooping) 





Three Learning Principles Data Snooping 


Fun Time 


Which of the following can result in unsatisfactory test performance in 
machine learning? 


@ data snooping 
@ overfitting 

© sampling bias 
© all of the above 





Three Learning Principles Data Snooping 


Fun Time 


Which of the following can result in unsatisfactory test performance in 
machine learning? 

@ data snooping 

© overfitting 

© sampling bias 

@ all of the above 





Reference Answer: (4) 
A professional like you should be aware of 





those! :-) 
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Three Learning Principles Power of Three 


Three Related Fields 


Power of Three J 
Data Mining Artificial Intelligence 
e use (huge) data * compute e use data to make 
to find property something that inference about 
that is interesting shows intelligent an unknown 
behavior process 
e difficult to e ML is one e statistics contains 
distinguish ML possible route to many useful tools 
and DM in reality realize Al for ML 
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Three Learning Principles Power of Three 


Three Theoretical Bounds 


Power of Three J 


Hoeffding Multi-Bin Hoeffding 
P[BAD] P[BAD] P[BAD] 
< 2exp(-2€N) < 2Mexp(—2c?N) < Am4(2N)exp(...) 
e one hypothesis e M hypotheses e all H 
e useful for e useful for e useful for 





verifying/testing validation 


training 
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Three Learning Principles 


Power of Three 


Three Linear Models 


Power of Three 





PLA/pocket 








S 
+O 


plausible err = 0/1 
(small flipping noise) 
minimize specially 
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linear regression 


logistic regression 


h(x) = 0(s) 
>O- 


plausible err = CE 
(maximum likelihood) 
minimize iteratively 


h(x)=s 
>O- 


friendly err = squared 
(easy to minimize) 
minimize analytically 
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Three Learning Principles Power of Three 


Three Key Tools 
Power of Three j 


Feature Transform | Regularization Validation 


En(w) — En(W) En(w) -  Ein(Wrec) Ein(h) — Evai(h) 
Ac(H) > Ac(He) | Ac(H) > d(H, A) 94 — {g7,---,9y} 


e by using more * by augmenting * by reserving K 
complicated ^ regularizer Q examples as Dyal 

e lower Ein e lower dece e fewer choices 

e higher dvc e higher Ein e fewer examples 





Three Learning Principles Power of Three 


Three Learning Principles 
Power of Three J 


Occam's Razer 
simple is good 













| Sampling Bias 
class matches exam 





DELERS A00) 011010) 
honesty is best policy 





Three Learning Principles Power of Three 


Three Future Directions 


Power of Three j 


More Transform 





More Regularization] Less Label 


bagging decision tree support vector machine neural network kernel 


AdaBoost 299"egation sparsity autoencoder coordinate descent 


dual Uniform blending deep learning nearest neighbor decision stump 


kernel LogReg large-margin Prototype quadratic programming SVR 


GBDT pca random forest Matrix factorization Gaussian kernel 
k-means OOB error RBF network probabilistic SVM 


ready for the jungle! | 


soft-margin 





Three Learning Principles Power of Three 


Fun Time 
What are the magic numbers that repeatedly appear in this class? 
0 9 
@ 1126 


© both 3 and 1126 
@ neither 3 nor 1126 





Three Learning Principles Power of Three 


Fun Time 
What are the magic numbers that repeatedly appear in this class? 
0 9 
@ 1126 


© both 3 and 1126 
@ neither 3 nor 1126 









Reference Answer: 9 


3 as illustrated, and you may recall 1126 
somewhere :-) 


Three Learning Principles Power of Three 


Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 


Lecture 15: Validation 






Lecture 16: Three Learning Principles 
e Occam’s Razor 











simple, simple, simple! 
e Sampling Bias 
match test scenario as much as possible 
e Data Snooping 
any use of data is ‘contamination’ 
e Power of Three 
relatives, bounds, models, tools, principles 





e next: ready for jungle! 
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